The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Thu May 28 20:57:29 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Thu May 28 20:57:29 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X5.27.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X5.27.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X5.27.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X5.27.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 127 127
Albania 127 127
Algeria 127 127
Andorra 127 127
Angola 127 127
Antigua and Barbuda 127 127
Argentina 127 127
Armenia 127 127
Australia 1016 1016
Austria 127 127
Azerbaijan 127 127
Bahamas 127 127
Bahrain 127 127
Bangladesh 127 127
Barbados 127 127
Belarus 127 127
Belgium 127 127
Belize 127 127
Benin 127 127
Bhutan 127 127
Bolivia 127 127
Bosnia and Herzegovina 127 127
Botswana 127 127
Brazil 127 127
Brunei 127 127
Bulgaria 127 127
Burkina Faso 127 127
Burma 127 127
Burundi 127 127
Cabo Verde 127 127
Cambodia 127 127
Cameroon 127 127
Canada 1778 1778
Central African Republic 127 127
Chad 127 127
Chile 127 127
China 4191 4191
Colombia 127 127
Comoros 127 127
Congo (Brazzaville) 127 127
Congo (Kinshasa) 127 127
Costa Rica 127 127
Cote d’Ivoire 127 127
Croatia 127 127
Cuba 127 127
Cyprus 127 127
Czechia 127 127
Denmark 381 381
Diamond Princess 127 127
Djibouti 127 127
Dominica 127 127
Dominican Republic 127 127
Ecuador 127 127
Egypt 127 127
El Salvador 127 127
Equatorial Guinea 127 127
Eritrea 127 127
Estonia 127 127
Eswatini 127 127
Ethiopia 127 127
Fiji 127 127
Finland 127 127
France 1397 1397
Gabon 127 127
Gambia 127 127
Georgia 127 127
Germany 127 127
Ghana 127 127
Greece 127 127
Grenada 127 127
Guatemala 127 127
Guinea 127 127
Guinea-Bissau 127 127
Guyana 127 127
Haiti 127 127
Holy See 127 127
Honduras 127 127
Hungary 127 127
Iceland 127 127
India 127 127
Indonesia 127 127
Iran 127 127
Iraq 127 127
Ireland 127 127
Israel 127 127
Italy 127 127
Jamaica 127 127
Japan 127 127
Jordan 127 127
Kazakhstan 127 127
Kenya 127 127
Korea, South 127 127
Kosovo 127 127
Kuwait 127 127
Kyrgyzstan 127 127
Laos 127 127
Latvia 127 127
Lebanon 127 127
Lesotho 127 127
Liberia 127 127
Libya 127 127
Liechtenstein 127 127
Lithuania 127 127
Luxembourg 127 127
Madagascar 127 127
Malawi 127 127
Malaysia 127 127
Maldives 127 127
Mali 127 127
Malta 127 127
Mauritania 127 127
Mauritius 127 127
Mexico 127 127
Moldova 127 127
Monaco 127 127
Mongolia 127 127
Montenegro 127 127
Morocco 127 127
Mozambique 127 127
MS Zaandam 127 127
Namibia 127 127
Nepal 127 127
Netherlands 635 635
New Zealand 127 127
Nicaragua 127 127
Niger 127 127
Nigeria 127 127
North Macedonia 127 127
Norway 127 127
Oman 127 127
Pakistan 127 127
Panama 127 127
Papua New Guinea 127 127
Paraguay 127 127
Peru 127 127
Philippines 127 127
Poland 127 127
Portugal 127 127
Qatar 127 127
Romania 127 127
Russia 127 127
Rwanda 127 127
Saint Kitts and Nevis 127 127
Saint Lucia 127 127
Saint Vincent and the Grenadines 127 127
San Marino 127 127
Sao Tome and Principe 127 127
Saudi Arabia 127 127
Senegal 127 127
Serbia 127 127
Seychelles 127 127
Sierra Leone 127 127
Singapore 127 127
Slovakia 127 127
Slovenia 127 127
Somalia 127 127
South Africa 127 127
South Sudan 127 127
Spain 127 127
Sri Lanka 127 127
Sudan 127 127
Suriname 127 127
Sweden 127 127
Switzerland 127 127
Syria 127 127
Taiwan* 127 127
Tajikistan 127 127
Tanzania 127 127
Thailand 127 127
Timor-Leste 127 127
Togo 127 127
Trinidad and Tobago 127 127
Tunisia 127 127
Turkey 127 127
Uganda 127 127
Ukraine 127 127
United Arab Emirates 127 127
United Kingdom 1397 1397
Uruguay 127 127
US 127 127
US_state 414147 414147
Uzbekistan 127 127
Venezuela 127 127
Vietnam 127 127
West Bank and Gaza 127 127
Western Sahara 127 127
Yemen 127 127
Zambia 127 127
Zimbabwe 127 127
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4333
Alaska 807
Arizona 1091
Arkansas 4530
California 4161
Colorado 3947
Connecticut 646
Delaware 267
Diamond Princess 72
District of Columbia 73
Florida 4634
Georgia 10332
Grand Princess 73
Guam 73
Hawaii 377
Idaho 2059
Illinois 5910
Indiana 6009
Iowa 5482
Kansas 4529
Kentucky 6591
Louisiana 4345
Maine 1101
Maryland 1697
Massachusetts 1099
Michigan 5247
Minnesota 4928
Mississippi 5386
Missouri 5942
Montana 1850
Nebraska 3398
Nevada 821
New Hampshire 763
New Jersey 1641
New Mexico 1812
New York 4149
North Carolina 6336
North Dakota 2134
Northern Mariana Islands 58
Ohio 5653
Oklahoma 4268
Oregon 2175
Pennsylvania 4467
Puerto Rico 73
Rhode Island 430
South Carolina 3145
South Dakota 2711
Tennessee 6107
Texas 12776
Utah 1006
Vermont 1020
Virgin Islands 73
Virginia 7982
Washington 2839
West Virginia 2906
Wisconsin 4274
Wyoming 1320
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 39613
China 127
Diamond Princess 108
Korea, South 98
Japan 97
Italy 95
Iran 92
Singapore 89
France 88
Germany 88
Spain 87
US 86
Switzerland 84
United Kingdom 84
Belgium 83
Netherlands 83
Norway 83
Sweden 83
Austria 81
Malaysia 80
Australia 79
Bahrain 79
Denmark 79
Canada 78
Qatar 78
Iceland 77
Brazil 76
Czechia 76
Finland 76
Greece 76
Iraq 76
Israel 76
Portugal 76
Slovenia 76
Egypt 75
Estonia 75
India 75
Ireland 75
Kuwait 75
Philippines 75
Poland 75
Romania 75
Saudi Arabia 75
Indonesia 74
Lebanon 74
San Marino 74
Thailand 74
Chile 73
Pakistan 73
Luxembourg 72
Peru 72
Russia 72
Ecuador 71
Mexico 71
Slovakia 71
South Africa 71
United Arab Emirates 71
Armenia 70
Colombia 70
Croatia 70
Panama 70
Serbia 70
Taiwan* 70
Turkey 70
Argentina 69
Bulgaria 69
Latvia 69
Uruguay 69
Algeria 68
Costa Rica 68
Dominican Republic 68
Hungary 68
Andorra 67
Bosnia and Herzegovina 67
Jordan 67
Lithuania 67
Morocco 67
New Zealand 67
North Macedonia 67
Vietnam 67
Albania 66
Cyprus 66
Malta 66
Moldova 66
Brunei 65
Burkina Faso 65
Sri Lanka 65
Tunisia 65
Ukraine 64
Azerbaijan 63
Ghana 63
Kazakhstan 63
Oman 63
Senegal 63
Venezuela 63
Afghanistan 62
Cote d’Ivoire 62
Cuba 61
Mauritius 61
Uzbekistan 61
Cambodia 60
Cameroon 60
Honduras 60
Nigeria 60
West Bank and Gaza 60
Belarus 59
Georgia 59
Bolivia 58
Kosovo 58
Kyrgyzstan 58
Montenegro 58
Congo (Kinshasa) 57
Kenya 56
Niger 55
Guinea 54
Rwanda 54
Trinidad and Tobago 54
Paraguay 53
Bangladesh 52
Djibouti 50
El Salvador 49
Guatemala 48
Madagascar 47
Mali 46
Congo (Brazzaville) 43
Jamaica 43
Gabon 41
Somalia 41
Tanzania 41
Ethiopia 40
Burma 39
Sudan 38
Liberia 37
Maldives 35
Equatorial Guinea 34
Cabo Verde 32
Sierra Leone 30
Guinea-Bissau 29
Togo 29
Zambia 28
Eswatini 27
Chad 26
Tajikistan 25
Haiti 23
Sao Tome and Principe 23
Benin 21
Nepal 21
Uganda 21
Central African Republic 20
South Sudan 20
Guyana 18
Mozambique 17
Yemen 13
Mongolia 12
Mauritania 9
Nicaragua 9
Malawi 3
Syria 3
Zimbabwe 1
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 127
Korea, South 98
Japan 97
Italy 95
Iran 92
Singapore 89
France 88
Germany 88
Spain 87
US 86
Switzerland 84
United Kingdom 84
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-04-27        18379
## 2    Afghanistan           <NA> <NA> 2020-04-26        18378
## 3    Afghanistan           <NA> <NA> 2020-03-02        18323
## 4    Afghanistan           <NA> <NA> 2020-05-18        18400
## 5    Afghanistan           <NA> <NA> 2020-03-03        18324
## 6    Afghanistan           <NA> <NA> 2020-04-05        18357
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                     57                  1703     0.03347035
## 2                     50                  1531     0.03265839
## 3                      0                     1     0.00000000
## 4                    173                  7072     0.02446267
## 5                      0                     1     0.00000000
## 6                      7                   349     0.02005731
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  3.231215                   1.755875        18348
## 2                  3.184975                   1.698970        18348
## 3                  0.000000                       -Inf        18348
## 4                  3.849542                   2.238046        18348
## 5                  0.000000                       -Inf        18348
## 6                  2.542825                   0.845098        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             31  NA   NA         NA                           NA
## 2             30  NA   NA         NA                           NA
## 3            -25  NA   NA         NA                           NA
## 4             52  NA   NA         NA                           NA
## 5            -24  NA   NA         NA                           NA
## 6              9  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9888281 -56.0
Hawaii Grocery_Pharmacy 0.9609603 -34.0
New Hampshire Parks 0.9571465 -20.0
Maine Transit -0.9046724 -50.0
Connecticut Grocery_Pharmacy -0.9015716 -6.0
Alaska Residential 0.8872480 13.0
Utah Residential -0.8712573 12.0
South Dakota Parks 0.8633000 -26.0
Vermont Parks 0.8564863 -35.5
Alaska Grocery_Pharmacy -0.8062819 -7.0
Hawaii Residential -0.8054281 19.0
Wyoming Parks -0.7938901 -4.0
Utah Transit -0.7862642 -18.0
Massachusetts Workplace -0.7613693 -39.0
Connecticut Transit -0.7575427 -50.0
Rhode Island Workplace -0.7523053 -39.5
Alaska Workplace -0.7314780 -34.0
Wyoming Transit -0.7198774 -17.0
Utah Parks -0.6859819 17.0
Hawaii Parks 0.6568936 -72.0
Vermont Grocery_Pharmacy -0.6518199 -25.0
Maine Workplace -0.6517954 -30.0
Utah Workplace -0.6482783 -37.0
New York Workplace -0.6471123 -34.5
Rhode Island Retail_Recreation -0.6339875 -45.0
Arizona Grocery_Pharmacy -0.6317509 -15.0
Montana Workplace -0.6124922 -40.5
Rhode Island Residential -0.6118197 18.5
Nebraska Workplace 0.6052030 -32.5
New Jersey Parks -0.5951073 -6.0
New Jersey Workplace -0.5948684 -44.0
Hawaii Transit 0.5926936 -89.0
New York Retail_Recreation -0.5871326 -46.0
Hawaii Workplace 0.5670535 -46.0
North Dakota Retail_Recreation -0.5348122 -42.0
Connecticut Residential 0.5327810 14.0
New York Parks 0.5273782 20.0
North Dakota Parks 0.5165068 -34.0
Massachusetts Retail_Recreation -0.5152581 -44.0
Connecticut Retail_Recreation -0.5122778 -45.0
Maine Parks 0.5062076 -31.0
West Virginia Parks 0.5022227 -33.0
Arizona Retail_Recreation -0.4981496 -42.5
New Jersey Retail_Recreation -0.4962379 -62.5
Connecticut Workplace -0.4922282 -39.0
Montana Parks -0.4913929 -58.0
Nebraska Residential -0.4838934 14.0
New Jersey Grocery_Pharmacy -0.4824036 2.5
Wyoming Workplace -0.4811244 -31.0
Iowa Parks -0.4750891 28.5
New Mexico Grocery_Pharmacy -0.4720830 -11.0
Montana Residential 0.4701424 14.0
Rhode Island Parks 0.4695783 52.0
New Mexico Residential 0.4486095 13.5
Illinois Transit -0.4484600 -31.0
New Mexico Parks 0.4468408 -31.5
Kansas Parks 0.4427794 72.0
Vermont Residential 0.4356654 11.5
Pennsylvania Workplace -0.4293474 -36.0
South Carolina Workplace 0.4271542 -30.0
New Jersey Transit -0.4229413 -50.5
California Transit -0.4199108 -42.0
Massachusetts Grocery_Pharmacy -0.4174309 -7.0
Arizona Residential 0.4149626 13.0
Kentucky Parks -0.4148083 28.5
New Hampshire Residential -0.4147077 14.0
California Residential 0.4088591 14.0
Alabama Grocery_Pharmacy -0.4000034 -2.0
Maryland Workplace -0.3992353 -35.0
Maryland Grocery_Pharmacy -0.3983441 -10.0
Nevada Transit -0.3980929 -20.0
Montana Retail_Recreation -0.3968324 -51.0
Idaho Workplace -0.3966924 -29.5
Wisconsin Transit -0.3943750 -23.5
Idaho Grocery_Pharmacy -0.3938409 -5.0
Arizona Transit 0.3857316 -38.0
Idaho Transit -0.3842822 -30.0
Alabama Workplace -0.3790124 -29.0
New York Transit -0.3729709 -48.0
Pennsylvania Retail_Recreation -0.3713051 -45.0
Wyoming Grocery_Pharmacy -0.3666843 -10.0
New Mexico Retail_Recreation -0.3498572 -42.5
Montana Transit -0.3476216 -41.0
Michigan Parks 0.3427390 30.0
Nebraska Grocery_Pharmacy 0.3396076 -0.5
Florida Residential 0.3285895 14.0
California Parks -0.3274596 -38.5
Alaska Retail_Recreation 0.3245901 -39.0
Alabama Transit -0.3218371 -36.5
North Dakota Workplace 0.3144871 -40.0
Montana Grocery_Pharmacy -0.3142629 -16.0
Idaho Residential -0.3123942 11.0
Minnesota Transit -0.3116720 -28.5
Maine Retail_Recreation -0.3099377 -42.0
North Carolina Grocery_Pharmacy 0.3092447 0.0
Vermont Retail_Recreation 0.3026708 -57.0
Pennsylvania Parks 0.2902198 12.0
Mississippi Residential 0.2892632 13.0
Virginia Transit -0.2885140 -33.0
Colorado Residential 0.2856790 14.0
Maryland Retail_Recreation -0.2849017 -39.0
Idaho Retail_Recreation -0.2829552 -40.0
Arkansas Retail_Recreation -0.2790183 -30.0
North Carolina Workplace 0.2716219 -31.0
Rhode Island Transit -0.2712873 -56.0
Texas Residential -0.2711555 15.0
Kansas Workplace 0.2704042 -32.5
Vermont Workplace -0.2681671 -43.0
Utah Retail_Recreation -0.2641293 -40.0
Maryland Residential 0.2639370 15.0
Nevada Residential 0.2635278 17.0
Oregon Grocery_Pharmacy 0.2598638 -7.0
Nevada Retail_Recreation -0.2573471 -43.0
Arkansas Parks -0.2572279 -12.0
West Virginia Grocery_Pharmacy -0.2569323 -6.0
Illinois Workplace -0.2539939 -31.0
Texas Workplace 0.2519294 -32.0
Rhode Island Grocery_Pharmacy 0.2509044 -7.5
Tennessee Workplace -0.2505811 -31.0
Illinois Parks 0.2478125 26.5
Wisconsin Parks 0.2444354 51.5
Florida Parks -0.2441952 -43.0
Tennessee Residential 0.2430082 11.5
South Carolina Parks -0.2427975 -23.0
Texas Parks 0.2400364 -42.0
California Grocery_Pharmacy -0.2384177 -11.5
California Retail_Recreation -0.2347245 -44.0
New York Grocery_Pharmacy -0.2317197 8.0
Georgia Grocery_Pharmacy -0.2312113 -10.0
Missouri Residential -0.2308773 13.0
Pennsylvania Grocery_Pharmacy -0.2300072 -6.0
Washington Workplace -0.2261964 -38.0
Arkansas Residential 0.2256138 12.0
California Workplace -0.2099594 -36.0
North Carolina Transit 0.2088605 -32.0
Michigan Workplace -0.2086979 -40.0
North Carolina Residential 0.2064664 13.0
Kansas Grocery_Pharmacy -0.2000217 -14.0
New Jersey Residential 0.1997727 18.0
West Virginia Workplace 0.1965556 -33.0
Oregon Residential 0.1952058 10.5
Illinois Residential 0.1950067 14.0
Mississippi Grocery_Pharmacy -0.1946849 -8.0
Iowa Transit 0.1895382 -24.0
Georgia Workplace -0.1894082 -33.5
North Dakota Grocery_Pharmacy -0.1874335 -8.0
Missouri Workplace 0.1823783 -28.5
Colorado Parks -0.1815401 2.0
Virginia Residential 0.1757176 14.0
Georgia Retail_Recreation -0.1753267 -41.0
South Dakota Transit -0.1722290 -40.0
New Mexico Transit 0.1720286 -38.5
Wisconsin Residential -0.1661776 14.0
Florida Retail_Recreation 0.1657178 -43.0
Connecticut Parks 0.1648305 43.0
Ohio Transit 0.1616932 -28.0
Virginia Grocery_Pharmacy -0.1588196 -8.0
Washington Residential 0.1577082 13.0
Pennsylvania Transit -0.1524817 -42.0
Georgia Residential -0.1495106 13.0
Minnesota Parks 0.1490956 -9.0
Indiana Retail_Recreation 0.1490025 -38.0
South Carolina Residential -0.1480626 12.0
New Hampshire Retail_Recreation -0.1449328 -41.0
Oklahoma Residential 0.1446754 15.0
Virginia Parks 0.1425309 6.0
Massachusetts Parks 0.1406106 39.0
Indiana Residential 0.1399632 12.0
South Dakota Retail_Recreation -0.1381474 -38.5
Mississippi Transit -0.1380060 -38.5
Wisconsin Workplace -0.1376444 -31.0
Washington Grocery_Pharmacy 0.1368602 -7.0
North Dakota Transit 0.1362997 -48.0
Michigan Retail_Recreation -0.1334368 -53.0
South Dakota Residential 0.1333668 15.0
Alabama Parks 0.1330283 -1.0
Wyoming Retail_Recreation -0.1322402 -39.0
Massachusetts Transit -0.1321623 -45.0
Kentucky Grocery_Pharmacy 0.1244041 4.0
Ohio Parks -0.1237620 67.5
Alabama Retail_Recreation 0.1225953 -39.0
Maine Residential -0.1214394 11.0
Washington Transit -0.1203021 -33.5
Oregon Retail_Recreation 0.1178742 -40.5
North Carolina Parks -0.1175254 7.0
Texas Transit 0.1156979 -41.0
Texas Grocery_Pharmacy 0.1154046 -14.0
New Hampshire Grocery_Pharmacy -0.1137478 -6.0
Kansas Transit -0.1127226 -26.5
Mississippi Retail_Recreation -0.1120810 -40.0
Massachusetts Residential 0.1109768 15.0
Oklahoma Parks -0.1108734 -18.5
Florida Workplace -0.1107605 -33.0
Minnesota Retail_Recreation 0.1106100 -40.0
Minnesota Workplace -0.1094673 -33.0
Maryland Transit -0.1079063 -39.0
Arkansas Workplace -0.1078327 -26.0
Oregon Parks 0.1072908 16.5
South Dakota Workplace 0.1067270 -35.0
Nebraska Retail_Recreation 0.1056267 -36.0
Wyoming Residential 0.1046233 12.5
Ohio Residential 0.1031352 14.0
Arkansas Transit 0.1029893 -27.0
Idaho Parks 0.1029357 -22.0
Indiana Parks -0.0982284 29.0
Wisconsin Grocery_Pharmacy 0.0968551 -1.0
Arizona Workplace -0.0960182 -35.0
Maine Grocery_Pharmacy -0.0946027 -13.0
Georgia Parks 0.0936513 -6.0
New York Residential 0.0925785 17.5
Oklahoma Grocery_Pharmacy -0.0918267 -1.0
Indiana Workplace 0.0906754 -34.0
Nebraska Transit -0.0904027 -9.0
Pennsylvania Residential 0.0901232 15.0
Virginia Workplace -0.0900045 -31.5
Mississippi Workplace -0.0884230 -33.0
Missouri Transit -0.0874351 -24.5
Oregon Workplace -0.0848137 -31.0
New Hampshire Transit -0.0816634 -57.0
Virginia Retail_Recreation -0.0810618 -35.0
Nevada Parks 0.0804016 -12.5
South Carolina Transit 0.0783478 -45.0
Michigan Residential 0.0780795 15.0
Kentucky Retail_Recreation 0.0772638 -29.0
Tennessee Parks -0.0758393 10.5
Indiana Grocery_Pharmacy -0.0752953 -5.5
Colorado Transit 0.0750912 -36.0
Michigan Grocery_Pharmacy -0.0746823 -11.0
Ohio Grocery_Pharmacy 0.0716032 0.0
Minnesota Residential -0.0715039 17.0
Kentucky Residential 0.0691744 12.0
Washington Parks 0.0673945 -3.5
West Virginia Residential -0.0672015 11.0
North Dakota Residential -0.0669668 17.0
Michigan Transit 0.0668881 -46.0
South Carolina Retail_Recreation -0.0648357 -35.0
Minnesota Grocery_Pharmacy 0.0635973 -6.0
North Carolina Retail_Recreation 0.0634871 -34.0
Ohio Retail_Recreation 0.0603142 -36.0
Oklahoma Workplace 0.0600869 -31.0
Washington Retail_Recreation -0.0550892 -42.0
Missouri Retail_Recreation -0.0531486 -36.0
South Carolina Grocery_Pharmacy 0.0526156 1.0
Missouri Parks 0.0509697 0.0
Florida Grocery_Pharmacy 0.0502493 -14.0
Kentucky Workplace -0.0497516 -36.0
New Hampshire Workplace 0.0477427 -37.0
West Virginia Transit -0.0477178 -45.0
Missouri Grocery_Pharmacy 0.0469529 2.0
Iowa Grocery_Pharmacy -0.0451845 4.0
South Dakota Grocery_Pharmacy 0.0441891 -9.0
Illinois Grocery_Pharmacy -0.0440824 2.0
Oregon Transit 0.0436747 -27.5
Indiana Transit 0.0406368 -29.0
Kentucky Transit 0.0402077 -31.0
Nebraska Parks 0.0389615 55.5
Florida Transit -0.0383784 -49.0
Texas Retail_Recreation 0.0375962 -40.0
Illinois Retail_Recreation 0.0372380 -40.0
Colorado Grocery_Pharmacy -0.0357304 -17.0
Arizona Parks -0.0352708 -44.5
Colorado Retail_Recreation -0.0329526 -44.0
Mississippi Parks -0.0312567 -25.0
Nevada Workplace 0.0309636 -40.0
Ohio Workplace -0.0293688 -35.0
Iowa Workplace 0.0290408 -30.0
Alabama Residential -0.0278429 11.0
Oklahoma Retail_Recreation 0.0246989 -31.0
Tennessee Grocery_Pharmacy 0.0245776 6.0
Iowa Retail_Recreation -0.0239535 -38.0
Utah Grocery_Pharmacy 0.0234731 -4.0
Tennessee Transit -0.0194372 -32.0
Kansas Residential -0.0186333 13.0
New Mexico Workplace 0.0183149 -34.0
Iowa Residential -0.0149262 13.0
Kansas Retail_Recreation -0.0135407 -37.0
Georgia Transit -0.0091876 -35.0
Oklahoma Transit 0.0067179 -26.0
Colorado Workplace 0.0064816 -39.0
Maryland Parks -0.0062705 27.0
Wisconsin Retail_Recreation 0.0061743 -44.0
Tennessee Retail_Recreation -0.0056357 -30.0
Nevada Grocery_Pharmacy 0.0051934 -12.5
West Virginia Retail_Recreation 0.0050039 -38.5
Vermont Transit 0.0010671 -63.0
Arkansas Grocery_Pharmacy 0.0007220 3.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Thu May 28 20:58:54 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net